Translating XBRL Into Description Logic. An Approach Using Protege, Sesame & OWL
نویسندگان
چکیده
In the context of the eTen project, WINS, a web-based business intelligence service to public and private financial institutions has been designed and implemented. One of the goals of the project was to provide new financial knowledge on companies from information gathered through interoperable information services. The services were implemented under the new emerging standard XBRL used for financial reporting. We sketch how relevant financial information was extracted from annual financial reportings. We also show at the same time the limitations we encountererd with the XBRL schema, due to the lack of reasoning support over XML-based data and information extracted from documents. To overcome these difficulties, we describe the “ontologization” of XBRL, which we assume to be a necessary requisite for large intelligent web-based financial information and decision support systems. 1 General Background In the context of the eTen project, WINS, a web-based business intelligence service to public and private financial institutions has been designed and implemented. One of the goals of the project was to provide new financial knowledge on companies from information gathered through interoperable information services [1]. The services were implemented under the new emerging standard called XBRL (eXtensible Business Reporting Language) used for financial reporting. In the following, we sketch how relevant financial information was extracted from annual financial reportings. We also show at the same time the limitations we encountererd with the XBRL schema, due to the lack of reasoning support over XML-based data and information extracted from documents that is finally mapped onto XBRL instances. In the first part of our submission, we just summarize our way of information extraction guided by XBRL, and afterwards describing our work dedicated to the ontologization of XBRL, which we assume to be a necessary requisite for large intelligent web-based financial information and decision support systems. The work described here will be further carried out within an Integrated Project of the 6th Framework, called MUSING (MUlti-Industry, Semantic-based next generation business Intelligence). 1 ETEN 2003/1, Grant agreement nr. C51083. WINS stands for Web-based Intelligence for common-interest fiscal Networked Services’. 2 For more information, see XBRL International: http://www.xbrl.org. 2 Incremental Information Extraction Guided by XBRL The next three subsections present the basic setting, viz., XBRL-guided information extraction of structured and unstructured documents. 2.1 Knowledge-Driven Information Extraction from Structured and Unstructured Documents The actual input for the information extraction (IE) task in WINS consists of balance sheets in PDF format, containing structured forms (tables) and free text (included, for example, in the annexes of balance sheets). Relevant information extracted from these sources are merged and mapped onto the XBRL format. A terminological clarification should be given at this place. IE often refers to the task of filling useror application-defined templates with the result of information detected by natural language analysis tools in textual documents. For certain applications, knowledge bases are available, supporting the IE task. Such knowledge might consist of taxonomies, thesauri, or ontologies. In this case, knowledge-driven IE tries to populate knowledge bases with instances detected in the textual documents. This was the situation in WINS, where the XBRL taxonomy was guiding the IE task through the analysis of both tabular data and free text. In the end, an XBRL structure should be instantiated with the information extracted from the annual reports of companies. 2.2 The Mapping Process from Text to XBRL The mapping process has been implemented within a Web service made available to the WINS partners. The Web service operates on PDF/text files from WINS data providers and returns files, containing the data in XBRL format. In a first step, text and tables from the PDF documents were extracted. It was also necessary to apporimatively recontructs the original layout, which is getting lost in the PDF-to-text conversion. Once this has been done, the WINS information extraction module inspects the generated HTML documents, trying to find correspondences in the text of the tables for labels of concepts contained in the overall XBRL taxonomy. But not only the detection of realizations of XBRL concepts in the document is important. The extraction tools must also detect relevant dates in the tables as well as currencies used, so that the figures contained in the tables , e.g., balance and profit & loss (P&L;) tables, are getting their correct interpretation. Since the XBRL taxonomy is also considering information about a company as such (name, address, number of employees, etc), the extraction tools need to detect this kind of information. In order to obtain such information, we implemented a simple named entity recognition algorithm for detecting names of companies, locations, and relevant persons. 2.3 Incremental Process Even though the automatic process of mapping structured data into XBRL is not perfectly accurate, it already brings a significant efficiency improvement in providing XBRL data for various applications, such as self-assessment-based company rating. But more is necessary. Aggregation of information is needed, consisting, for example, in adding to the XBRL document information that is not included in the tables. As a first step, annexes present in the annual report of a company are analyzed and the relevant information for the XBRL document is extracted. This is the place where linguistic analysis and text mining comes into play, since the information is no longer available in structured form (positions in a table), but given as free text. Let us give an example of incremental processing from an annual report. In a P&L; table, a position is reserved for Umsatzerlöse (trading profit) with a sum of 159,356 K Euro. The PDF-to-XBRL converter detects this fact and generates the corresponding XBRL code. But the table contains a reference to a section in the annex, and here, we find in free text that parts of the Umsatzerlöse has been reached abroad: Von den Umsatzerlösen wurden 78.299 Tsd. Euro (Vorjahr 1.653 Tsd. Euro) im Ausland erzielt. This is a case, where obviously text analysis is needed if one wants to extract this information, since a simple pattern wouldn’t suffice for ensuring the extraction task. The knowledge base (here the XBRL taxonomy), tells us that Umsatzerlöse in this portion of text is a relevant term. Hence, this information is extracted and combined with the information gained from the table. Some more facts are also needed, which can be considered as soft facts; for instance, changes in the management structure of a company, analysis of trend in a special branch, information that is present in documents external to the annual report, etc. This information will be mapped onto XBRL, if there are some terms in the taxonomy corresponding to this information. Other relevant information needs to be encoded in a usable way, so that additional relations betwen soft and hard facts can be stated or inferred. In this case the incremental building of XBRLencoded information is not trivial at all, since documents need to be concerned that have been written during a longer period of time. There is a need here for a more principled way of performing merging of information, based on temporal reasoning (see, e.g., [2]). Therefore, we opt for a porting of XBRL to a representation language that supports the detection of relation, not explicitely defined in XBRL. The next section describe this ongoing work that we have recently started. 3 Translating XBRL Taxonomies into OWL In this section, we report on our effort in translating XBRL taxonomies into an instance of description logic, viz., OWL, the Web Ontology Language [3]. We first introduce the basic tools and then move on to the translation process. 3.1 OWL, Protégé, and Sesame XBRL taxonomies make use of XML Schema [4] in order to describe the structure of an XBRL document as well as to define new datatypes and properties, relevant to XBRL. Given such a schema and a validation program, it is then possible to check whether a concrete (business) document conforms to the syntactic structure, defined in the schema. As we have already indicated above, we probably need languages and tools that go beyond the expressive syntactic power of XML Schema. OWL, the Web Ontology Language is the new emerging language for the Semantic Web that originates from the DAML+OIL standardisation. OWL still makes use of constructs from RDF [5] and RDFS [6], such as rdf:resource, rdfs:subClassOf, or rdfs:domain, but its two important variants OWL Lite and OWL DL restrict the expressive power of RDFS, thereby ensuring decidability. What makes OWL unique (as compared to RDFS or even XML Schema) is the fact that it can describe resources in more detail and that it comes with a well-defined model-theoretical semantics, inherited from description logic [7]. From description logic, OWL inherits further modelling constructs, such as intersectionOf, equivalentClass, or cardinality restrictions. The description logic background furthermore provides automated reasoning support such as consistency checking of the TBox and the ABox, subsumption checking during instance retrieval, etc. The XBRL OWL base taxonomy was manually developed using the OWL plugin of the Protégé knowledge base editor [8]. This version of Protégé comes with a partial OWL Lite support. The latest version of XBRL together with the Accounting Principles for German (our example, see below) consists of 2,414 concepts, 34 properties, and 4,780 instances. Overall, this translates into 24,395 unique RDF triples. The basic idea during our effort was that even though we are developing an XBRL taxonomy in OWL using Protégé, the information that is stored on disk is still RDF on the syntactic level. We were thus interested in RDF data base systems which make sense of the semantics of OWL and RDFS constructs such as 3 TBox and ABox are terms, introduced in the early days of description logic (or terminological logic; [?]). TBox refers to the terminological knowledge—knowledge about concepts that are relevant to our domain; for example, that sharesItemType is a subclass (or subconcept) of shares. In that sense, a TBox defines a domain schema. An ABox, however, represents assertions about individuals (of certain concepts), for instance, that t genInfo.doc.id is related to t genInfo.doc via the partOf property. Nowadays, the term ontology usually refers to both the TBox and the ABox. rdfs:subClassOf or owl:equivalentClass. We currently experimenting with the Sesame open-source middleware framework for storing and retrieving RDF data [9]. Sesame partially supports the semantics of RDFS and OWL constructs via entailment rules that compute “missing” RDF triples (the deductive closure) in a forward-chaining style at compile time. Since sets of RDF statements represent RDF graphs, querying information in an RDF framework means to specify path expressions. Sesame comes with a very powerful query language, SeRQL, which includes (i) generalised path expressions, (ii) a restricted form of disjunction through optional matching, (iii) existential quantification over predicates, and (iv) Boolean constraints. From an RDF point of view, additional 62,598 triples were generated through Sesame’s (incomplete) forward chaining inference mechanism. Let us give an example of a slightly simplified entailment rule that computes the missing triples with hasPart in predicate position (see also figure 1): Since we have classified hasPart (as well as partOf) as a transitive OWL property, the above rule will fire, making implicit knowledge explicit and produces new triples such as
منابع مشابه
Supporting Early Adoption of OWL 1.1 with Protege-OWL and FaCT++
This paper describes integrated tools support for OWL 1.1 in the form of the FaCT++ Description Logic reasoner and the ProtégéOWL ontology editor. Challenges of designing and implementing OWL 1.1 reasoning algorithms are highlighted, and an outline of an OWL 1.1 API and editing environment is provided.
متن کاملRepresenting Financial Reports on the Semantic Web: - A Faithful Translation from XBRL to OWL
We discuss a translation of financial reports from the XBRL format into Semantic Web language OWL. Different from existing approaches that do mechanic translation from XBRL’s XML schema into OWL, our approach can faithfully preserve the implicit semantics in XBRL and enable the logic model of financial reports. We show that such a translation reduces the risk of redundancy and inconsistency, an...
متن کاملAnalysis of XBRL documents representing financial statements using Semantic Web Technologies
This paper presents an approach to analyze XBRL documents using semantic web technologies. The system takes an XBRL document and converts its information into RDF files. The obtained files are merged with an OWL ontology describing financial information domain. The system enables the formulation of SPARQL queries over the generated data which facilitate the analysis of the financial information...
متن کاملModeling OWL with Rules: The ROWL Protege Plugin
In our experience, some ontology users find it much easier to convey logical statements using rules rather than OWL (or description logic) axioms. Based on recent theoretical developments on transformations between rules and description logics, we develop ROWL, a Protégé plugin that allows users to enter OWL axioms by way of rules; the plugin then automatically converts these rules into OWL DL ...
متن کاملExpanded Abstract for Protégé Workshop Jul 6 - 9 , 2004 Enterprise Vocabulary Development in Protege / OWL : Workflow and Concept History Requirements
The National Cancer Institute has developed the NCI Thesaurus, a biomedical vocabulary that provides consistent, unambiguous codes and definitions for concepts used in cancer research. It currently has about 34,000 concepts, in 20 hierarchies. A lexical component provides human usable definition and other information. Description logic is used to ensure complete, consistent and non-redundant de...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2006